Thesis

Ranking rules: consensus

Scoring ranking rules

Plurality

k-rules

Borda count

Distances

Categorical features

Numerical features

Mixed features

Algorithms

dknn

Sets of distances

rknn

Datasets

List

Binary

Categorical attributes

Less than 10 attributes

Breast Cancer

This is one of three datasets provided by the Oncology Institute that has repeatedly appeared in the machine learning literature.

This data set includes 201 instances of one class and 85 instances of another class. The instances are described by 9 attributes. In this version of the dataset all the attributes are nominal.

Description of the attributes:

Cars

Car Evaluation Database was derived from a simple hierarchical decision mode. The attributes include: buying price, maint price of the maintenance, number of doors, persons capacity in terms of persons to carry, lug_boot the size of luggage boot, safety estimated safety of the car and class. The class is the car acceptability and its possible values are: unacc, acc, good, vgood.

Description of the attributes:

Somerville

The skin dataset is collected by randomly sampling B,G,R values from face images of various age groups (young, middle, and old), race groups (white, black, and asian), and genders obtained from FERET database and PAL database. Total learning sample size is 245057; out of which 50859 is the skin samples and 194198 is non-skin samples. Color FERET Image Database: [Web Link], PAL Face Database from Productive Aging Laboratory, The University of Texas at Dallas: [Web Link]. This dataset is of the dimension 245057 * 4 where first three columns are B,G,R (x1,x2, and x3 features) values and fourth column is of the class labels (decision variable y).

Description of the attributes:

Tic-Tac-Toe

This database encodes the complete set of possible board configurations at the end of tic-tac-toe games, where “x” is assumed to have played first. The target concept is “win for x” (i.e., true when “x” has one of 8 possible ways to create a “three-in-a-row”).

Description of the attributes:

10 or more attributes

Mixed: categorical and numerical attributes

Less than 10 attributes

Cesarean

Mammographic masses

10 or more attributes

Travel insurance
  • Source: Kaggle
  • Number of rows: 18219
  • Number of attributes: 10

Description of the attributes:

Numeric attributes

Less than 10 attributes

Banknote authentication

Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.

Description of the attributes:

  • variance of Wavelet Transformed image (continuous)
  • skewness of Wavelet Transformed image (continuous)
  • curtosis of Wavelet Transformed image (continuous)
  • entropy of image (continuous)
  • class

Haberman

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

  • Age of patient at time of operation (numerical)
  • Patient’s year of operation (year - 1900, numerical)
  • Number of positive axillary nodes detected (numerical)
  • Survival status (class attribute)
    • 1 = the patient survived 5 years or longer
    • 2 = the patient died within 5 year

Description of the attributes:

Skin segmentation

10 or more attributes


Multiclass

Categorical attributes

Less than 10 attributes

Balance Scale

This data set was generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The attributes are the left weight, the left distance, the right weight, and the right distance. The correct way to find the class is the greater of \((left\_distance * left\_weight)\) and \((right\_distance * right\_weight)\). If they are equal, it is balanced.

Description of the attributes:

Chess

Post operative data

10 or more attributes

Poker hand

Mixed: categorical and numerical attributes

Less than 10 attributes

Teaching-assistant
Abalone

Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope – a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

From the original data examples with missing values were removed (the majority having the predicted value missing), and the ranges of the continuous values have been scaled for use with an ANN (by dividing by 200).

Description of the attributes:

Life expectancy

10 or more attributes

Numeric attributes

Less than 10 attributes

Life expectancy

Seeds

10 or more attributes

This database encodes the complete set of possible board configurations at the end of tic-tac-toe games, where “x” is assumed to have played first. The target concept is “win for x” (i.e., true when “x” has one of 8 possible ways to create a “three-in-a-row”).

Description of the attributes:

Notes

Changes to the original datasets if any:

Examples

The mini_iris dataset

Manhattan distance with mini_iris

Euclidean distance with mini_iris

Train dknn k = 3, distance = euclidean, ties = randomly nrow(train) = 9 and nrow(test) = 3


Matrix of distances:

test X1 X2 X3 X5 X6 X7 X9 X10 X11
X4 2.144761 1.627882 0.7141428 4.5607017 3.9648455 3.8923001 4.160529 6.522270 4.2731721
X8 3.491418 3.295451 3.6687873 0.9643651 0.4358899 0.3464102 1.104536 2.917190 0.7937254
X12 4.713809 4.570558 4.9909919 0.8831761 1.2083046 1.2369317 1.195826 1.838478 0.8366600

Ranking for each instance:

testtrain X1 X2 X3 X5 X6 X7 X9 X10 X11
X4 3 2 1 8 5 4 6 9 7
X8 8 7 9 4 2 1 5 6 3
X12 8 7 9 2 4 5 3 6 1
Train dknn k = 3, distance = euclidean, ties = randomly
nrow(train) = 9 and nrow(test) = 3 





Predict... Choosing a label [method =  randomly , k =  3] for the instance X4: 

setosa     setosa     setosa versicolor versicolor versicolor  virginica  virginica  virginica 
    3          2          1          8          5          4          6          9          7 
    
    
setosa > setosa > setosa > versicolor > versicolor > virginica > virginica > versicolor > virginica 


--> Selected values:
setosa setosa setosa

--> Probabilities:
    setosa versicolor  virginica 
         1          0          0 

The label for this instance is: setosa 

Predict... Choosing a label [method =  randomly , k =  3 ] for the instance with ranking: 
     
setosa     setosa     setosa versicolor versicolor versicolor  virginica  virginica  virginica 
8          7          9          4          2          1          5          6          3 


versicolor > versicolor > virginica > versicolor > virginica > virginica > setosa > setosa > setosa 

--> Selected values:
versicolor versicolor virginica 

--> Probabilities:
    setosa versicolor  virginica 
 0.0000000  0.6666667  0.3333333 

The label for this instance is: versicolor 

Predict... Choosing a label [method =  randomly , k =  3 ] for the instance with ranking: 

setosa     setosa     setosa versicolor versicolor versicolor  virginica  virginica  virginica 
8          7          9          2          4          5          3          6          1 

virginica > versicolor > virginica > versicolor > versicolor > virginica > setosa > setosa > setosa 

--> Sure values:
versicolor  virginica 
         2          1 

--> Tied values:
virginica 
        3 

Solving the ties... randomly

--> Number of elements to select randomly: 1 

--> Selected values:
[1] versicolor virginica  virginica 
Levels: setosa versicolor virginica
times
    setosa versicolor  virginica 
 0.0000000  0.3333333  0.6666667 

The label for this instance is: virginica 
[1] "setosa"     "versicolor" "virginica" 

> sink("iris_manhattan_randomly")

Chebyshev distance with mini_iris

mini_zoo

Noelia Rico

2019-06-25